Sqoop: Sqoop Merge

Sqoop Merge is a tool that allows us to combine two datasets. The entries of one dataset override the entries of the older dataset. It is useful for efficiently transferring the vast volume of data between Hadoop and structured data stores like relational databases. After performing the merge operation, we can import the data back into the Apache Hive or HBase.

In short, the Sqoop merge tool “flatten” the two datasets into one, by taking the newest available records for each primary key.

Sqoop Merge Syntax

sqoop merge (generic-args) (merge-args)

Merge Argument

The Sqoop merge tool runs the MapReduce job, which takes two directories as an input. One is the newer dataset, and the other is the older dataset. These two directories are specified with the –new-data and –onto respectively.
The output generated by this MapReduce job is placed in the directory in HDFS, which is specified by the –target-dir.
While merging the datasets, it is assumed that there is a unique primary key value in each record.
We specify the primary key column by –merge-key argument. More than one row in the same dataset must not have the same primary key. Otherwise, data may be lost.
For parsing the dataset and extracting the key column, the auto-generated class from the previous import must be used.
We can specify the class name and the jar file with the arguments –class-name and –jar-file. If it is not available, then we can recreate this class by using the Sqoop Codegen tool.
The Sqoop merge tool typically runs after the incremental import with the date-last-modified mode, that is, (sqoop import —incremental lastmodified …).

sqoop merge –new-data newer –onto older –target-dir merged –jar-file datatypes.jar –class-name Foo –merge-key id

Sqoop

Sqoop Merge

No comments:

Post a Comment